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Design of an efficient fingerprint that detects homologous proteins at distant sequence identity has been a 
great challenge. This paper proposes a strategy to extract an ideal-like fingerprint with high specificity and 
sensitivity from a group of sequences related to a fold. The approach is devised based on the assumptions 
that the critical residues for a protein fold may be conserved in three aspects, i.e. sequence, structure, and 
intramolecular interaction, and embedded in secondary structures. We hypothesized that the residues 
satisfying such conditions simultaneously may work as an efficient fingerprint. This idea was tested on 
protein folds of various classes, such as beta-strand rich, alpha + beta proteins and alpha/beta proteins with 
discrete sequence similarities. The fingerprint for each fold was generated by selecting the overlapped 
conserved residues (OCR) from the conserved residues obtained using independent three alignment 
methods, i.e. multiple sequence alignment, structure-based alignment, and alignment based on the 
interstrand hydrogen-bonds. The OCR fingerprints showed more than 90% detection efficiency for all the 
folds tested and were identified to be almost the minimal fingerprints composed of only critical residues. 
This study is expected to provide an important conceptual improvement in the identification or design of 
ideal fingerprints for a protein fold. 

An exponential growth of protein sequence database motivated the development of various computational 
approaches for the recognition of structural/functional features and classification of uncharacterized 
protein sequences 1-4 . The methods basically utilize the protein sequence patterns or fingerprints that 
represent the proteins with specific structures or functions 5 7 . The patterns are generally generated by the 
alignment of a group of sequences with similar structure, function or family relationship. Three kinds of sequence 
patterns have been representatively used to tackle the relationship of protein sequences, structures and functions: 
(i) small motifs (e.g. identified by PROSITE, Pratt, TRIOLOGY, etc.) are the group of conserved residues 
identified from the short conserved sequences in the region well-known for substantial biological activity such 
as catalytic sites and metal ion binding sites 811 ; (ii) multiple motifs or blocks (e.g. identified by PRINTS, InterPro, 
etc.) are the group of independent, sequentially or spatially distinct motifs that usually occur together and suggest 
a putative function 12,13 and; (iii) profiles or family signatures are generated using the level of amino acid conser- 
vation at different positions in the alignment of complete protein domain. PROSITE, HHpred, PSI-BLAST, etc. 
are the tools used to identify such patterns 1417 . These all patterns work as the signatures to identify similar features 
in uncharacterized sequences. 

An ideal fingerprint for a given fold might be one that can detect all the homologous proteins with perfect 
sensitivity and exclude any non-homologous proteins with perfect specificity. Such a fingerprint should include 
the critical residues, which can detect all the homologous proteins, and not include any non-essential residues that 
can decrease the sensitivity. As mentioned above, many strategies were devised to identify such efficient sequence 
patterns and they were evaluated to be somewhat successful to characterize the protein sequences and structures. 
However, there are still some limitations in the fingerprints 18 20 . For instance, small motifs for substantial 
biological activity generally show high sensitivity, but low specificity in the detection of homologous sequences. 
On the other hand, the fingerprints such as blocks and profiles show high specificity, but relatively low sensitivity. 
In particular, the sensitivities of most sequence patterns are not satisfactory when finding remote protein 
homologs. Further intensive studies need to be executed to produce a lot more effective schemes to evoke a 
fingerprint close to ideality. 

We propose a new approach to generate an efficient fingerprint for the detection of protein homologs. The 
approach was devised on the basis of following assumptions. First, the crucial residues for a protein fold might be 
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conserved in three aspects, i.e. sequence, structure, and intramole- 
cular interaction. Second, structurally important residues may be 
embedded in the secondary structure elements, such as a-helices 
and P-strands, rather than in the loop regions. Finally, the residues 
satisfying such conditions simultaneously might be the critical resi- 
dues for a protein fold, and work as an efficient fingerprint for the 
detection of homologous sequences. To evaluate these hypotheses, 
this study attempts to identify the residues based on the above 
assumptions for various protein folds and examined their efficiencies 
as a fingerprint. 

We begin by describing the general scheme of the design of fin- 
gerprints using the devised approach. The approach is first imple- 
mented on Immunoglobulin V-set domain (IgV) as a model system 
to present the detailed procedure. Next, the method is benchmarked 
by applying on various protein folds such as beta-strand rich, alpha 
+ beta, and alpha/beta protein folds with a range of sequence sim- 
ilarities. These studies demonstrate that the proposed approach is 
effective to extract an efficient fingerprint with high specificity and 
sensitivity. The implications of our results for the protein homology 
detection are also discussed. 

Results 

Design of OCR-based fingerprints. Figure 1 shows the scheme of 
protein fingerprint mining based on the devised strategy. In the first 
step, the conserved residues in the three aspects, i.e. sequence, struc- 
ture, and intramolecular interaction, were identified independently 
from a group of homologous sequences for a specific fold. To identify 
the residues conserved at sequence level, a general multiple sequence 



alignment (MSA) was performed using ClustalW 21 . Structure based 
alignment (SB A) was applied to the target sequences to identify 
structurally conserved residues using Dali server 22 . For the intra- 
molecular interactions, this study focused on the non-local hydro- 
gen bonds between beta-strands because they are considered as one 
of the most important factors to determine a protein fold and 
stability 23 . In addition, their patterns can be identified more clearly 
compared to other intramolecular interactions. To select the con- 
served residues for the hydrogen bond patterns of the beta-strands, 
the method to align the beta-strand sequences based on the inter- 
strand hydrogen bond patterns of the P-sheet was employed (This 
method will be referred to as SSS-based approach because this 
approach was devised to find the supersecondary structure(SSS)- 
determining residues) 24 . In this study, the hydrophobicity and 
hydrophilicity were used as the criteria of conservedness of a 
position to maximize the number of conserved positions in the 
alignments. In the second step, the amino acid positions found to 
be commonly conserved among the three different alignments were 
selected. The residues were called "Overlapped Conserved Residues" 
(OCR) and used to create the OCR fingerprint for the fold detection 
process. In addition, the OCR embedded in the beta strand region 
was used to generate the OCR s fingerprint. Further, OCR MIN 
fingerprint was produced by eliminating the conserved positions in 
the OCR s fingerprint one by one. The OCR-based fingerprints such 
as OCR, OCR 5 , and OCR MIN were used to detect the homologous 
proteins for a target fold, and their fold detection efficiencies were 
compared with the fingerprints obtained by MSA, SBA and SSS- 
based approaches. 
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Figure 1 | Scheme of protein fingerprint mining. Flow chart shows the steps to extract the various OCR fingerprints. First, three independent alignment 
methods, i.e. MSA, SBA and SSS-based method, were applied to the target folds using hydrophobicity and hydrophilicity as conservedness criteria 
and the conserved fingerprint from each method was obtained. Second, overlapped conserved residues in three alignments are identified to generate the 
OCR fingerprints. Further, elimination of the non-essential residues in OCR fingerprint generates the OCR s and OCR MIN fingerprints. 
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Figure 2 | Conserved sequence residues obtained by MSA, SBA and SSS methods. Protein sequence patterns for Immunoglobulin-V set domain were 
obtained by MSA, SBA, SSS and OCR approach. Distribution of conserved positions in secondary structure elements (SSEs) is shown for each alignment 
method. Sequence pattern is PROSITE-like pattern. Here, the expression "x(d,r)" indicates the "d" as the minimum number of residues between two 
consecutive conserved positions and the distance "r" is the maximum number of residues between two consecutive conserved positions. Similarly, 
expression "x" is used if the minimum and maximum distance between two consecutive conserved positions is same. 



Implementation of OCR-based approach on Immunoglobulin V- 
set domain. In the first phase of this study, the OCR-based approach 
was implemented as a model system on the antibody variable 
domain-like proteins (IgV-set domain). The "IgV-set domain pro- 
teins" have a beta sandwich structure where ten strands are arranged 
in two P-sheets in a Greek-key fashion 25 , where the lowest sequence 
identity between the two structural homologous is —23%. Protein 
Databank contains approximately 558 IgV-set domains, where the 
sequence length of the structural varies from 110 to 130 amino acid 
residues. This study illustrates how to identify the critical residues 
embedded in the beta-strands of the IgV-set domain using the OCR- 
based approach, and their efficiency as a protein signature to detect 
remote protein homologous was examined. The fold detection effi- 
ciency is a term to consider both detection sensitivity and specificity, 
and their exact definitions are described in Method section. 

i) Homology detection efficiencies of MSA, SBA and SSS-based finger- 
prints. To create a protein sequence pattern for IgV-set domain, 10 
distantly related protein sequences of IgV-set domains were selected 
(Supplementary Table SI online). The conserved sequence patterns 
were created using three independent different alignment methods, 
i.e. MSA, SBA and SSS-based method. Figure 2 shows the sequence 
patterns generated from each sequence alignment method. The 
sequence patterns consisted of 43, 40 and 32% of the total residue 
numbers for MSA, SBA and SSS-based methods, respectively. The 
sequence patterns were tested to detect the homologous protein 
structures against the protein structure database, PDB, as the target 
database. Table 1 lists the homology detection efficiencies of the 
MSA, SBA and SSS-based fingerprints to 44, 51 and 76%, respect- 
ively. The conserved sequence patterns determined by these three 
methods were highly specific in nature with zero false positives. 
These results suggest that the specificities of the fingerprints are 
perfect, but there is a limitation in the sensitivities of the identified 
conserved sequence patterns. 

ii) Homology detection efficiency of OCR-based approach. The com- 
mon positions among the identified conserved positions in the three 
sequence alignments were used to develop the OCR fingerprint. The 
OCR fingerprint, shown in Figure 2, consists of 23% of the total 
residue numbers, which was almost 25 to 50% shorter in length than 
the previous three fingerprints. The fold detection efficiency of the 
OCR fingerprint was 80%, higher than the fold detection efficiencies 
of the MSA, SBA and SSS-based fingerprints, and there were no false 
positives (Table 1). These results suggest that the sensitivity of the 
OCR-based fingerprint for homology detection can be higher than 



the three individual methods by maintaining the perfect specificity 
despite the significant decrease in fingerprint size. This also provides 
an important insight that some non-essential residues in the MSA, 
SBA and SSS-based fingerprints can be eliminated, but the critical 
residues can be maintained during the extraction of the overlapped 
conserved residues. 

Hi) Homology detection efficiency of the OCR-fingerprint in beta- 
strands. To test the importance and efficiency of the fingerprints in 
the secondary structures, a new fingerprint was generated by select- 
ing the conserved residues in the beta-strands of the IgV-set domain. 
The new fingerprint, designated OCR s , consisted of just 12% of the 
sequence residues, and its pattern length was just half of the OCR 
fingerprint. As shown in Table 1, the fold detection efficiency of the 
OCR s fingerprint was improved to 87% compared to the 80% effi- 
ciency of the original OCR fingerprint. The specificity of this finger- 
print was also perfect. These results suggest that the OCR residues in 
the loop regions may be mostly non-essential residues that are 
mainly responsible for the decrease in the fold scan sensitivity. 
Therefore, the removal of these non-essential residues can improve 
the fold detection efficiency. This also suggests that the OCR residues 
in the beta-strands include the critical residues to detect the homo- 
logous proteins efficiently. Overall, the beta-strand embedded amino 
acids that are conserved in terms of the sequence, structure, and 
hydrogen bond pattern can be a very efficient fingerprint for a pro- 
tein fold. 



Table 1 | Database Scan results for Immunoglobulin V-set domain 
Proteins 
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Here, table lists the percentage of the sequence residues involved in the generated fingerprints as 
well as the detection efficiencies of the respective fingerprints for the IgV-set domain. Here #res 
indicates total conserved positions and %res indicates percentage of the conserved positions for 
each fold. Similarly, #Hits, TP, FP and EFF indicates total structural hits, true positive hits, false 
positive hits and fold detection efficiency, respectively. 
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iv) Minimization of the fingerprint size embedded in the beta-strands. 
In the above studies, the OCR s fingerprint composed of just 12% of 
the conserve amino acids in the beta-strands regions could be used to 
detect the homologous proteins of the IgV-set domain quite effi- 
ciently, whereas the OCR residues in the loop regions were not 
essential for detecting the structural fold. The next question was 
whether further non-essential residues were included in the iden- 
tified OCR s fingerprint and whether their elimination could improve 
the efficiency of the OCR s fingerprint further. To examine this pos- 
sibility, an attempt was made to reduce the number of conserved 
residues from the OCR s , which represents the protein signature, by 
eliminating the conserved positions individually, and investigating 
the efficiency of the reduced fingerprints. Generally, a further reduc- 
tion in the sequence pattern length resulted in an increase in the fold 
scan sensitivity, but at the same time, the occurrence of false positive 
hits was increased by multiple folds, resulting in an overall decrease 
in the fold scan efficiency (Supplementary Table S2 online). On the 
other hand, two exceptions were observed, where an elimination of 
the hydrophobic conserved positions, i.e. either F s37 or V 877 in lg6vK, 
improved the fold scan efficiency compared to the efficiency of 
OCR s . For example, OCR s without the conserved F 837 residue, which 
is designated as OCR MIN in Table 1, showed 95% fold detection 
efficiency despite the detection of some false positives. Further elim- 
ination of both the hydrophobic conserved positions, together, 
decreased the fold detection efficiency significantly. Overall, the 
sequence pattern length could be reduced by only 1 position with 
an increase in the fold detection efficiency, and the number of false 
positives increased as more conserved positions in OCR s were elimi- 
nated. These results provide two insights. First, the OCR s fingerprint 
for the IgV-set domain proteins may be composed of almost the 
minimal critical residues, and are very close to OCR MIN , which deter- 
mine the similar structural fold quite efficiently. Second, further 
elimination of the non-essential residues can enhance the fold detec- 
tion efficiency further similar to the above studies. 

Benchmarking the OCR-based approach on Dataset. The above 
results confirm that the OCR based approach can be a simple way of 
identifying the efficient fingerprint to detect protein homologs. Here, 
this study examined whether the OCR-based approach could be also 
used to identify such efficient fingerprints for other proteins with a 
range of folds and sequence similarities. Similar to the model study, 
two OCR-based fingerprints, i.e. OCR and OCR s , were generated for 
the various target folds, and their fold detection efficiencies were 
compared with the fingerprints created by the MSA, SBA and SSS- 
based approaches. This study also examined if the OCR s fingerprint 
was close to the minimal fingerprint to detect the structural fold. 

i) Selection of protein folds and generation of fingerprints. The data- 
sets consist of three different fold classes of proteins in the Structural 
Classification of Proteins (SCOP) database, i.e. all-beta, a + p, and a/ 
p. Each fold class contained 4 structural folds, where the members in 
each fold were structurally homologous with a range of sequence 
identities. Each fold class had 2 representative structural folds at 
low sequence identity and 2 representative structural folds at high 
sequence identity. Each protein fold consisted of the protein mem- 
bers of single or multiple protein families, and 10 representative 
protein structures with the most sequence diversity were selected. 
Table 2 lists the structural and sequence properties of the selected 
protein folds. The conserved sequence patterns for the target folds in 
Table 2 were generated using MSA, SBA, SSS and OCR-based 
approaches (Supplementary Figure SI online). 

ii) Homology detection. Homology detection was performed against 
the PDB using the generated fingerprints and their fold detection 
efficiencies were compared. Table 3 lists the percentage of the 
sequence residues involved in the generated fingerprints as well as 
the detection efficiencies of the respective fingerprints for the target 



folds. As shown in the results, the general trend of the fold detection 
efficiency was similar to the result of the model protein study using 
the IgV-set domain proteins. The detection efficiencies of the OCR 
fingerprints generally showed improved detection efficiency com- 
pared to the MSA, SBA, and SSS-based fingerprints for most of the 
target folds. The use of the OCR s fingerprint enhanced the detection 
efficiency further. For example, in the cases of the cysteine protei- 
nases and pyruvate kinase N-terminal domain-like protein, a dra- 
matic change in efficiency was observed, where the fold detection 
efficiency of OCR s fingerprints increased from 61% and 48% to 86% 
and 94%, respectively, compared to the efficiencies of the OCR fin- 
gerprints. The sizes of the respective OCR s fingerprints ranged from 
6% to 17% of the total residue numbers of the target protein folds. 
The maximum efficiency of the OCR-based fingerprints, either OCR 
or OCR s , was in the range of 84%- 100%, whereas the MSA, SBA and 
SSS-based fingerprints showed relatively low and very different 
detection efficiencies depending on the target folds. 

In two exceptional cases, the fold detection efficiency of OCR was 
higher than the OCR s . In the cases of the Cupredoxin-like proteins 
and 50 S Ribosomal Protein L25-like proteins, the fold detection 
efficiency of the OCR s fingerprints decreased significantly from 
91% and 97% to 17% and 35%, respectively, compared to their 
OCR fingerprints. In these cases, the high number of false positives 
was detected in the database scan (Supplementary Table S3 online). 
The OCR in the loop region of two protein folds was presumed to 
include some critical residues for homology detection, and the omis- 
sion of the critical residues in the OCR s fingerprints may result in a 
substantial decrease in specificity. 

Hi) Minimization of the beta-strands embedded OCR fingerprint size. 
These results suggest that the size of the OCR s fingerprints are only 
5-15% of the total residue numbers of the target protein folds. 
Interestingly, the fingerprint sizes of the protein folds with low or 
high similarity were not so different. An attempt was made to identify 
the fingerprints with lower numbers by reducing the OCR s finger- 
prints and examining their detection efficiencies. The OCR s finger- 
prints for the target folds P-Grasp (ubiquitin-like) and Ribosomal 
protein L25 presented the minimum size sequence pattern, for which 
any further conserved positions could not be eliminated without 
sacrificing the fold detection efficiency. For the other target folds, 
the sequence pattern length could be reduced at a maximum by only 
1-2 residues. These results suggest that the identified OCR s finger- 
prints for the target folds are close to the minimum critical residues 
needed to detect the target folds efficiently, like the Immunoglobulin 
V-set domain case. On the other hand, the use of the minimized 
OCR s , i.e. OCR MIN , led to further enhancement of the detection 
efficiency. Their detection efficiencies were at approximately 90% 
to 100% for most of the target folds (Table 3). 

Overall, the fold detection study for the target dataset confirmed 
the following three important outcomes of the model study. First, the 
OCR-based approach showed very high fold detection efficiency for 
the target folds. The fold detection efficiency of the MSA, SBA and 
SSS methods were relatively low and the efficiency of these methods 
differed from fold to fold. In contrast, the fingerprints obtained from 
the OCR based approach, i.e. OCR fingerprint, OCR s fingerprint and 
OCR MIN fingerprint, showed significantly improved efficiency and 
more than 90% fold detection efficiency at the maximum. Second, 
reducing the fingerprint size using the OCR-based approach proved 
to be efficient in eliminating the non-essential residues while retain- 
ing the critical conserved residues. Third, the OCR s fingerprint was 
almost the minimal fingerprint to detect the structure fold. 

Properties of the OCR-based fingerprints embedded in beta- 
strands. To determine if there were any common features of the 
identified OCR-based fingerprints above, the residues comprising 
the OCR s fingerprints was characterized at various aspects. No 
specific features were found for the target dataset common in the 
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Table 2 | Target dataset consists of 1 2 protein fold with structurally similar sequence dissimilar protein sequences 
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Here, table lists the 1 2 protein folds with structurally similar sequence dissimilar protein sequences which are used as target dataset. Fold class and title is listed in first column, second column shows the 
sequence length of representative structures of each fold. For each fold, secondary structure elements information, i.e. total number of residues involved in strand, helix and loop, are listed. Here, Min SEQ ID 
indicates the minimum sequence identity among the sequences representing the particular fold. 
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aspects of the side chain properties and their positional properties. 
The identified residues showed irregular patterns in terms of their 
polar and non-polar properties, and they were distributed unevenly 
from the core to surface regions (data not shown). 

On the other hand, an analysis of the distribution of the min- 
imum conserved positions stated the clustering of the conserved 
positions across the entire sequence length. The sequence patterns 
were a cluster-like pattern where the conserved residues were 
grouped into several blocks separated by irregular gaps. For 
example, as shown in Figure 3, the distribution of the over- 
lapped- conserved residues for the Immunoglobulin V-set domain 
showed five different clusters. Each cluster consisted of 2-3 amino 
acids and the distance between the clusters was varied. Figure S2 
shows the clusters of the other target folds. The fingerprint for 
each target fold contains 3-5 conserved residue clusters. Most of 
the conserved residue clusters contained 3-5 identified positions 
but the cluster size might be 12 residues long, as found in the 
RNAase A-like fold. The general length of the irregular gaps was 
10-20 amino acids, but it could be more than 40 residues, as in 
the case of the GFP-like protein. 



Comparison of fold detection efficiency with traditional methods. 

The OCR s fingerprints in the above results were proven to be 
extremely effective to detect the homologous structures. Bench- 
marking of the fold detection efficiency of the OCR-based 
approach, to check the practical importance of the method, was 
performed along the traditional methods such as PSTBLAST, 
HMMER, HHpred and FASTA search and the results were listed 
in Table 4. Fold detection efficiency of the PSTBLAST were in the 
range of 42% to 92%, which varied depending on the fold type. 
HMMER showed an improvement in fold detection efficiency with 
the detection of over 65% protein homologs for each fold in dataset, 
except in the case of (3-Grasp (Ubiquitin-like) fold where it showed 
just 39% of fold detection efficiency. HHpred and FASTA search 
showed a significant increase in fold detection efficiency with the 
detection of over 75% of sequence homologs for each fold. In some 
cases, HHpred and FASTA search showed better fold detection 
efficiency than the OCR-based approach. The results showed that 
the fold detection efficiency of the fingerprints obtained using the 
OCR-based approach is either competitive or better than the 
traditional approaches. 
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Discussion 

A major concern in the design of ideal-like protein fingerprints is 
how to improve their sensitivity for homology detection without 
sacrificing their specificity. This suggests that the non-essential resi- 
dues that can decrease the sensitivity should be excluded in the 
design with retaining the critical residues for a protein fold. This 
study demonstrated that such design was possible by extracting the 
beta- strand embedded residues that are conserved in terms of 
sequence, structure and hydrogen bonding pattern from a group of 
related protein sequences. The OCR-based fingerprints were found 
to be very efficient in detecting the homologous protein folds of the 
various classes, such as the beta-strand rich, alpha + beta proteins 
and alpha/beta proteins regardless their sequence similarities. Our 
results may provide an important conceptual improvement in the 
design of ideal fingerprint for a protein fold, which may make a 
contribution to the understanding of the relation between protein 
sequences and structures. 

In our study, the OCR-based approach was utilized to prepare the 
fingerprints for the protein folds including beta-strands. In the case 
of the a-helix rich proteins, the OCR-based approaches could not be 
applied efficiently to define the critical residues due to the lack of 
consistent intramolecular interactions such as the hydrogen bonds 
between the beta-strands. Nevertheless, the importance of eliminat- 
ing non-essential residues in the fold detection for a-helix rich pro- 
teins was also confirmed. The OCR H -fingerprint consisting of the 
overlapped conserved residues from a-helical region showed higher 
fold detection efficiency compared to each fingerprint generated 
respectively by MSA or SBA method. When an attempt was made 
to reduce the fingerprint size by eliminating the overlapped con- 
served positions individually, the efficiencies were improved gradu- 
ally and the minimum fingerprints, OCR MIN , were quite sensitive and 
specific to identify the structural folds. Supplementary Table S4 and 
S5 list the a-helix rich target folds description and the fold detection 
efficiency of the various fingerprints for the folds. 

The sizes of the OCR s fingerprints were only 5-15% of the target 
protein, but the small fingerprints were sufficient to detect the 
sequences for a given fold regardless of the protein folds and their 
similarities with perfect specificity. What makes the high specificity 
of these small size fingerprints? The overlapped conversed residues 
across the sequence length formed a small subset of clusters with 
neighboring or consecutive amino acids that resulted in the form of 
local sequence motif (Figure 3 and Supplementary Figure S2 online). 
Any disturbance to these small subsets of clusters, while searching for 
the minimum crucial positions for the target folds, decreased the fold 
detection specificity significantly (Supplementary Table S2 online). 
We presume that the high specificity of the OCR-based fingerprints 
was due to the presence of these clustered sequence motifs in the 
pattern, despite their small size. 

In the Table 4, fold detection efficiency of the OCR-based 
approach was compared with the traditional methods, demonstrat- 



ing that the OCR-based approach was quite competitive or even 
showed higher efficiency compared to other methods. In fact, the 
OCR-based approach and other traditional methods follow different 
algorithms in the detection of homologous proteins. Therefore, such 
direct comparison may not be perfectly legitimate to evaluate the 
performance of the methods. However, such comparison provides 
the insight that OCR-based approach can be very useful to detect 
protein homology. 

In our study, OCR-based sequence patterns could detect all or 
most of the known structure homologs of a protein from protein 
structure database. In particular, database scan using the OCR-based 
patterns was confirmed to be also efficient in the detection of remote 
homologous proteins. For example, OCR-based pattern developed 
using the 10 representative GFP-like sequences successfully iden- 
tified the domain G2 of Nidogen-1 (PDB ID: 1GL4 and 1H4U) as 
a homolog in our study (Supplementary Table S10 online). In fact, it 
is not easy to identify such relationship due to the low sequence 
similarity between the proteins. Fold detection using the protein 
sequence of avGFP or other GFP variant by the traditional 
approaches such as PSI-BLAST, HMMER, HHpred and FASTA 
search was unable to identify Domain G2 of nidogen-1 as structural 
homolog (Supplementary Table S10, SI 1, S12, S13 and S14 online). 
The relationship could be identified only after the structure of mouse 
nidogen globular fragment 2 was solved using X-ray crystal- 
lography 26 . Further, to check the possibility that novel homologous 
proteins can be identified using the OCR-fingerprints, we attempted 
to perform the fold scan against the larger database such as NCBI 
non-redundant (nr) protein sequence database. We expected that 
fold detection against the sequence database will provide more 
sequence hits which might not be well studied due to the lack of 
any structural or functional annotation. Identification of such 
remote homologous proteins was quite successful. For instance, sev- 
eral sequences with no significant sequence similarity were identified 
using the OCR-based pattern for Cupredoxin-like proteins. The 
accession numbers of the identified sequences were WP_ 
010687666, WP_019121393, WP_021320206, WP_004263537, 
WP_008217106, WP_019379850, etc. The identified sequences 
share around 15 ~ 24% of the sequence similarity with the repres- 
entative Cupredoxin-like protein (Supplementary Figure S15 
online). Tertiary structures of the identified sequences were modeled 
successfully, which showed that the sequences are homologous to the 
Cupredoxin-like proteins (More details about these results will be 
presented elsewhere). The identified sequences have been also anno- 
tated as Cupredoxin-like protein in NCBI sequence database while 
we were preparing this report, which also confirmed our results. 
Although we focused on demonstrating the characterization and 
efficiency of OCR-based approach in this report, these results implic- 
ate that the OCR-based approach can be an efficient tool in the search 
of novel homologous proteins for a specific target fold. We also 
expect that OCR-based approach/fingerprints can be combined with 
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Amino Acid Sequence 

[VA]-x(3,7)-[LIV]-[ST]-[CI]-x(12,15)-W-[FVI]-[RQ]-x(7)-E-x-[VL]-x(15,24)-[ST]-x(6,9)-[VLF]-x-[IL]-x(8,10)- 

Figure 3 | Distribution of OCR s across the protein sequence. Conserved positions in OCR s fingerprints of Immunoglobulin V-set domain are plotted 
across the entire protein sequence length for easy visualization. The figure shows conserved positions are not distributed equally but as the multiple 
conserved blocks. 
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other efficient algorithms or database such as PROSITE, which may 
generate much more efficient sequence patterns to characterize pro- 
tein sequences and structures. 

Methods 

Selection of protein folds. In the present study, evolutionary- related protein folds 
were derived from the Structural Classification of Proteins (SCOP) database 27 . Three 
[3-strand rich protein fold classes, i.e. all-beta, alpha -I- beta (a + |3) and alpha/beta {a/ 
P), were used. The protein folds in each class and protein structures of a particular fold 
were selected according to the following criteria: 

1. Protein structures are shown to be more conserved than the sequences during 
the evolutionary mechanism. Protein sequences representing a particular pro- 
tein fold within a superfamily can either be highly similar {sequence homologs) 
or dissimilar (remote homologs) in nature. Therefore, in the dataset, two 
structural folds consist of the homologous proteins with high sequence identity 
{around 30% or more) and two structural folds consist of the homologous 
proteins with low sequence identity (20% or less), were selected to identify the 
conserved sequence patterns. 

2. For each structural fold, 10 representative protein sequences within a super- 
family were selected in a way that no sequences have >90% sequence identity 
to each other. The sequence pattern generated from such sequences will be a 
fingerprint for a wider range of sequences for a fold. 

3. Structurally similar but sequence dissimilar protein family members or mem- 
bers missing one or two a-helices or |3-strands represents the cases of evolu- 
tionary pressure, where structure is fully or mostly intact regardless of the 
sequence change, were included in this study. 

4. Protein structural folds with different sizes, i.e. sequence length from 80 to 260 
amino acids, were selected. 

5. Low resolution protein structures, i.e. below 2.5A, were eliminated from the 
selection. 



Alignment of the sequences and mining of the conserved sequence pattern. Three 
sequence alignment methods were used: multiple sequence alignment (MSA) by 
ClustalW 21 , structure based alignment (SBA) by Dali server 22 , and SSS-based 
alignment. These alignment methods were performed for each fold using the ten 
representative protein sequences and/or structures. In the present study, the amino 
acid properties, such as hydrophobicity and hydrophilicity were used as the criteria to 
consider the conservedness of a position in the alignment to maximize the number of 
conserved positions in the alignments. A conserved position in this study was 
defined as the presence of either only hydrophobic or only hydrophilic residues at a 
particular position of the alignment. The amino acid residues V, I, L, M, F, W, C, A 
and Y are interchangeable at the hydrophobic conserved positions whereas residues 
Q, N, E, D, R, K, H, T, S, G, and P are interchangeable at the hydrophilic conserved 
positions. 

Multiple sequence alignment was performed by ClustalW web server for the 10 
representative protein sequences using the default parameters. Multiple structure 
alignment was performed using the DALI server. It performs a database search using 
an input query structure against the database of known structures {PDB) and returns 
the list of structural neighbors 28 . Now, protein structures, which correspond to the 10 
representative protein sequences used for MSA, were selected and automated 



structural alignment option were used to perform the multiple structure alignment. 
Further, the conserved positions in both the alignments were redefined based on 
hydrophobicity and hydrophilicity criterion. In the case of SSS, the alignment was 
performed separately for each strand and loop rather than the entire sequence. The 
alignment in the strand was performed using the inter-strand hydrogen (H)-bonds. 
The alignment of the residues in the loop region was performed manually using the 
physical properties of the amino acids. From the resulting alignment, conserved 
residue positions were identified and the conserved sequence patterns were obtained 
from each sequence alignment method. 

Overlapped Conserved Residues (OCR) and homologous fold detection. To 

identify the critical conserved residues at three aspects, i.e. sequence, structure, and 
intramolecular interaction, simultaneously, the above three independent alignment 
methods for each of the target fold were performed, and the common positions were 
extracted from the identified conserved positions, which are called the Overlapped 
Conserved Residues (OCRs). The OCR was used to generate an OCR-fingerprint. 
Similarly, the OCR s fingerprint was obtained utilizing the overlapped residues 
embedded in the strand region. The syntax of the OCR- fingerprint was similar to the 
PROSITE patterns. Therefore, they could be used directly for fold detection against 
the structure database. 

The standalone version of the EXPASY ScanProsite tool was used for fold detection 
using various sequence patterns as an input 29 . Over 78000 protein sequence from the 
PDB was downloaded and used as the input for the ScanProsite tool. Fold detection 
using the specific sequence patterns against the structure database was performed. 
The step by step process to obtain OCR-based fingerprint is detailed in 
Supplementary Information (Supplementary Text and Supplementary Table S6, S7, 
S8 and S9 online). The search picked up structural hits, which are classified into 'True 
Positives, TP', 'False Negatives, FN 5 and 'False Positives, FP' proteins. Identified 
structural hits (proteins) which are the members of the same superfamily as the 
representative proteins used to generate the pattern for the fold, are defined as 'true 
positives' hits, whereas members of the superfamily, which are not identified by the 
sequence pattern in fold detection are defined as 'false negatives'. Further, the iden- 
tified hits which do not belong to the superfamily in consideration are defined as 'false 
positives'. 

The effectiveness of an OCR-based pattern is determined in the terms of "sens- 
itivity" and "specificity". A fingerprint is defined as highly specific if it detects 
only 'true positives' hits and no or minimum 'false positives' hits. "Specificity" is 
calculated as the ratio of 'true positives 1 hits to the total of 'true positives' and 'false 
positives'. 



Specificity^) = 



TP 



(TP + FP) 



-100 



(1) 



A sequence pattern is highly sensitive if it detects all or most of the structure 
homologs. "Sensitivity" is calculated as the ratio of 'true positives' hits to the total 
number of structure homologs in PDB. 



Sensitivity(%) - 



TP 



(TP + FN) 



-100 



(2) 



A sequence pattern is highly efficient if it detects all or most of the homologous 
proteins, 'true positives' and no or minimum 'false positive'. "Efficiency" is calculated 
as the ratio of 'true positives' hits to the total number of hits. 
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TP 

EfficiencyC'A) = 100 (3) 



If, FP is 'zero' or 'low 5 ; 



Efficiency —Sensitivity 



(4) 



Fold detection efficiency using the OCR- fingerprints were identified and compared 
with the efficiency of the three independent alignment methods. 

Benchmarking of OCR-based approach against the target dataset. Fold detection 
efficiency of the OCR-based approached was tested against the target datasets, 
consists 12 protein folds in 3 different structural classes in SCOP, to benchmark the 
approach. For each fold, fingerprints such as MSA, SBA, SSS, OCR, OCR s and 
OCR M1N were obtained and fold detection against the PDB was performed. Fold 
detection efficiency for each fingerprint were listed and compared. 

Fold detection efficiency of the OCR-based approach was compared with fold 
detection efficiencies of the traditional methods such as PSI-BLAST, HMMER, 
HHpred and FASTA search 30 " 33 . Fold detection using PSI-BLAST and FASTA search 
were performed using one representative protein sequence for each fold against the 
Protein Data Bank. HMMER, using the default Significance E-values, were utilized to 
detect homologous protein sequence against the protein structure database. Similarly, 
HMM-HMM comparison based homology search tool HHpred was used for 
homology detection, using one representative protein sequence for each fold, against 
the manually uploaded PDB sequence database. Fold detection efficiency of OCR- 
fmgerprints with the PSI-BLAST, HMMER, HHpred and FASTA search were listed 
and compared. 
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